686 research outputs found

    Induction of integrated view for XML data with heterogeneous DTDs

    Get PDF

    Finding Related Publications: Extending the Set of Terms Used to Assess Article Similarity.

    Get PDF
    Recommendation of related articles is an important feature of the PubMed. The PubMed Related Citations (PRC) algorithm is the engine that enables this feature, and it leverages information on 22 million citations. We analyzed the performance of the PRC algorithm on 4584 annotated articles from the 2005 Text REtrieval Conference (TREC) Genomics Track data. Our analysis indicated that the PRC highest weighted term was not always consistent with the critical term that was most directly related to the topic of the article. We implemented term expansion and found that it was a promising and easy-to-implement approach to improve the performance of the PRC algorithm for the TREC 2005 Genomics data and for the TREC 2014 Clinical Decision Support Track data. For term expansion, we trained a Skip-gram model using the Word2Vec package. This extended PRC algorithm resulted in higher average precision for a large subset of articles. A combination of both algorithms may lead to improved performance in related article recommendations

    Sequence features involved in the mechanism of 3' splice junction wobbling

    Get PDF
    Background Alternative splicing is an important mechanism mediating the diversified functions of genes in multicellular organisms, and such event occurs in around 40-60% of human genes. Recently, a new splice-junction wobbling mechanism was proposed that subtle modifications exist in mRNA maturation by alternatively choosing at 5'- GTNGT and 3'- NAGNAG, which created single amino acid insertion and deletion isoforms. Results By browsing the Alternative Splicing Database information, we observed that most 3' alternative splice site choices occur within six nucleotides of the dominant splice site and the incidence significantly decreases further away from the dominant acceptor site. Although a lower frequency of alternative splicing occurs within the intronic region (alternative splicing at the proximal AG) than in the exonic region (alternative splicing at the distal AG), alternative AG sites located within the intronic region show stronger potential as the acceptor. These observations revealed that the choice of 3' splice sites during 3' splicing junction wobbling could depend on the distance between the duplicated AG and the branch point site (BPS). Further mutagenesis experiments demonstrated that the distance of AG-to-AG and BPS-to-AG can greatly influence 3' splice site selection. Knocking down a known alternative splicing regulator, hSlu7, failed to affect wobble splicing choices. Conclusion Our results implied that nucleotide distance between proximal and distal AG sites has an important regulatory function. In this study, we showed that occurrence of 3' wobble splicing occurs in a distance-dependent manner and that most of this wobble splicing is probably caused by steric hindrance from a factor bound at the neighboring tandem motif sequence

    BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature

    Get PDF
    BACKGROUND: To automatically process large quantities of biological literature for knowledge discovery and information curation, text mining tools are becoming essential. Abbreviation recognition is related to NER and can be considered as a pair recognition task of a terminology and its corresponding abbreviation from free text. The successful identification of abbreviation and its corresponding definition is not only a prerequisite to index terms of text databases to produce articles of related interests, but also a building block to improve existing gene mention tagging and gene normalization tools. RESULTS: Our approach to abbreviation recognition (AR) is based on machine-learning, which exploits a novel set of rich features to learn rules from training data. Tested on the AB3P corpus, our system demonstrated a F-score of 89.90% with 95.86% precision at 84.64% recall, higher than the result achieved by the existing best AR performance system. We also annotated a new corpus of 1200 PubMed abstracts which was derived from BioCreative II gene normalization corpus. On our annotated corpus, our system achieved a F-score of 86.20% with 93.52% precision at 79.95% recall, which also outperforms all tested systems. CONCLUSION: By applying our system to extract all short form-long form pairs from all available PubMed abstracts, we have constructed BIOADI. Mining BIOADI reveals many interesting trends of bio-medical research. Besides, we also provide an off-line AR software in the download section on http://bioagent.iis.sinica.edu.tw/BIOADI/

    Impacts of Two Types of El Niño and La Niña Events on Typhoon Activity

    Get PDF
    The HadISST (Hadley Centre Sea Ice and Sea Surface Temperature) dataset is used to define the years of El Niño, El Niño Modoki, and La Niña events and to find out the impacts of these events on typhoon activity. The results show that the formation positions of typhoon are farther eastward moving in El Niño years than in La Niña years and much further eastward in El Niño Modoki years. The lifetime and the distance of movement are longer, and the intensity of typhoons is stronger in El Niño and in El Niño Modoki years than in La Niña years. The Accumulated Cyclone Energy of typhoon is highly correlated with the Oceanic Niño Index with a correlation coefficient of 0.79. We also find that the typhoons anomalously decrease during El Niño years but increase during El Niño Modoki years. Besides, there are two types of El Niño Modoki, I and II. The intensity of typhoon in El Niño Modoki I years is stronger than in El Niño Modoki II years. Furthermore, the centroid position of the Western Pacific Warm Pool is strongly related to the area of typhoon formation with a correlation coefficient of 0.95

    Soft tagging of overlapping high confidence gene mention variants for cross-species full-text gene normalization

    Get PDF
    Abstract Background Previously, gene normalization (GN) systems are mostly focused on disambiguation using contextual information. An effective gene mention tagger is deemed unnecessary because the subsequent steps will filter out false positives and high recall is sufficient. However, unlike similar tasks in the past BioCreative challenges, the BioCreative III GN task is particularly challenging because it is not species-specific. Required to process full-length articles, an ineffective gene mention tagger may produce a huge number of ambiguous false positives that overwhelm subsequent filtering steps while still missing many true positives. Results We present our GN system participated in the BioCreative III GN task. Our system applies a typical 2-stage approach to GN but features a soft tagging gene mention tagger that generates a set of overlapping gene mention variants with a nearly perfect recall. The overlapping gene mention variants increase the chance of precise match in the dictionary and alleviate the need of disambiguation. Our GN system achieved a precision of 0.9 (F-score 0.63) on the BioCreative III GN test corpus with the silver annotation of 507 articles. Its TAP-k scores are competitive to the best results among all participants. Conclusions We show that despite the lack of clever disambiguation in our gene normalization system, effective soft tagging of gene mention variants can indeed contribute to performance in cross-species and full-text gene normalization.</p

    UMARS: Un-MAppable Reads Solution

    Get PDF
    [[abstract]]Background: Un-MAppable Reads Solution (UMARS) is a user-friendly web service focusing on retrieving valuable information from sequence reads that cannot be mapped back to reference genomes. Recently, next-generation sequencing (NGS) technology has emerged as a powerful tool for generating high-throughput sequencing data and has been applied to many kinds of biological research. In a typical analysis, adaptor-trimmed NGS reads were first mapped back to reference sequences, including genomes or transcripts. However, a fraction of NGS reads failed to be mapped back to the reference sequences. Such un-mappable reads are usually imputed to sequencing errors and discarded without further consideration.Methods: We are investigating possible biological relevance and possible sources of un-mappable reads. Therefore, we developed UMARS to scan for virus genomic fragments or exon-exon junctions of novel alternative splicing isoforms from un-mappable reads. For mapping un-mappable reads, we first collected viral genomes and sequences of exon-exon junctions. Then, we constructed UMARS pipeline as an automatic alignment interface.Results: By demonstrating the results of two UMARS alignment cases, we show the applicability of UMARS. We first showed that the expected EBV genomic fragments can be detected by UMARS. Second, we also detected exon-exon junctions from un-mappable reads. Further experimental validation also ensured the authenticity of the UMARS pipeline. The UMARS service is freely available to the academic community and can be accessed via http://musk.ibms.sinica.edu.tw/UMARS/.Conclusions: In this study, we have shown that some un-mappable reads are not caused by sequencing errors. They can originate from viral infection or transcript splicing. Our UMARS pipeline provides another way to examine and recycle the un-mappable reads that are commonly discarded as garbage
    corecore